SQL Server 2008 : Full-Text Search Troubleshooting

12/26/2010 4:48:54 PM

Full-Text Search Maintenance

After you create full-text catalogs and indexes that you can query, you have to maintain them. The catalogs and indexes maintain themselves, but you need to focus on backing up and restoring them as well as tuning your search solution for optimal performance. In SQL Server 2008, the full-text catalogs get fragmented from time to time, especially if you are using the Automatic (Track Changes Automatically) setting. You can check the level of fragmentation by using the following command:

SELECT table_id, status FROM sys.fulltext_index_fragments WHERE status=4 OR status=6;

If you notice that your tables are highly fragemented you will optimize your full-text indexes. Here is the command you would use to do this:

ALTER FULLTEXT CATALOG AdventureWorks2008 REORGANIZE;

Full-Text Search Performance

SQL Server FTS performance is most sensitive to the number of rows in the result set and number of search terms in the query. You should limit your result set to a practical number; most searchers are conditioned to look only at the first page of results for what they are looking for, and if they don’t see what they need there, they refine the search and search again. A good practical limit for the number of rows to return is 200. You should try, if at all possible, to use simple queries because they perform better than more complex ones. As a rule, you should use CONTAINS rather than FREETEXT because it offers better performance, and you should use CONTAINSTABLE rather than FREETEXTTABLE for the same reason.

Several factors are involved in delivering an optimal Full-Text Search solution. Consider the following:

Avoid indexing binary content. Convert it to text, if possible. Most IFilters do not perform as well as the text IFilter.
Use integer columns on the base table that comprises your unique index.
Partition large tables into smaller tables. There seems to be a sweet spot around 50 million rows, but your results may vary. Ensure that for large tables, each table has its own catalog. Place this catalog on a RAID 10 array, preferably on its own controller.
SQL Full-Text Search benefits from multiple processors, preferably four or more. A sweet spot exists on eight-way machines or better. You will find 64-bit hardware also offers substantial performance benefits over 32-bit.
Dedicate at least 512MB to 1GB of RAM to MSFTESQL by setting the maximum server memory to 1GB less than the installed memory. Set resource usage to run at 5 to give a performance boost to the indexing process (that is, sp_fulltext_service 'resource_usage',5), set ft crawl bandwidth (max) and ft notify bandwidth (max) to 0, and set max full-text crawl range to the number of CPUs on your system. Use sp_configure to make these changes.

Full-Text Search Troubleshooting

The first question you should ask yourself when you have a problem with SQL Full-Text Search is this: “Is the problem with searching or with indexing?” To help you make this determination, Microsoft has included three DMVs in SQL Server 2008:

sys.dm_fts_index_keywords
sys.dm_fts_index_keywords_by_document
sys.dm_fts_parser

The first two DMVs displays the contents of your full-text index. The first DMV returns the following columns:

Keyword— Each keyword in varbinary form.
Display_term— The keyword as indexed; all the accents are removed from the word.
Column_ID— The column ID where the word exists.
Document_Count— The number of times the word exists in that column.

The second DMV breaks down the keywords by document. Like the first DMV, it contains the Keyword, Display_term, and Column_ID columns, but in addition it contains the following two columns:

Document_ID— The row in which the keyword occurs.
Occurrence_count— The number of times the word occurs in the cell (a cell is also known as a tuple; it is a row-column combination—for example, the contents of the third column in the fifth row).

The first DMV, sys.dm_fts_index_keywords, is used primarily to determine candidate noise wordsit can be used to diagnose indexing problems. The second DMV, sys.dm_fts_index_keywords_by_document, is used to determine what is stored in your index for a particular cell.

Here are some examples of their usage:

select * From sys.dm_fts_index_keywords(DB_ID(),Object_iD('MyTable'))
select * From sys.dm_fts_index_keywords_by_document(DB_ID(),Object_iD('MyTable'))

These two DMVs are used to determine what occurs at index time. The third DMV, sys.dm_fts_parser, is used primarily to determine what happens at search time—in other words, how SQL Server Full-Text Search interprets your search phrase. Here is an example of its usage.

select * from sys.dm_fts_parser(@queryString, @LCID, @StopListID, @AccentSenstive)
@QueryString is your search word or phrase, @LCID is the LoCale ID for your language
 (determinable by  querying sys.fulltext_languages), @StopListID is your stoplist
file (determinable by querying sys.fulltext_stoplists), @AccentSensitive allows you
to set accent sensitivity (0 not sensitive, 1 sensitive to accents) . Here is an
example of how this works:
select * from sys.dm_fts_parser('café', 1033, 0, 1)
select * from sys.dm_fts_parser('café', 1033, 0, 0)

In the second example, you will notice that the Display_term is cafe and not café. These queries return the following columns:

Keyword— This is a varbinary representation of your keyword.
Group_id— The query parser builds a parse tree of the search phrase. If you have any Boolean searches, it assigns different group IDs to each part of the search term. For example in the search phrase '"Hillary Clinton" OR "Barack Obama"', Hillary and Clinton belong to Group ID 1 and Barack and Obama2. belong to Group ID
Phrase_id— Some words are indexed in multiple forms; for example, data-base is indexed as data, base, and database. In this case, data and base have the same phrase ID, and database has another phrase ID.
Occurence_count— This is how frequently the word apprears in the search string.
Special_term— This column refers to any delimiters that the parser finds in the search phrase. Possible values are Exact Match, End of Sentence, End of Paragraph, and End of Chapter.
Display_term— This is how the term would be stored in the index.
Expansion_type— This is the type of expansion, whether it is a thesaurus expansion (4), an inflectional expansion (2), or not expanded (0). For example, the following query shows the stemmed variants of the word run.
```
select * from sys.dm_fts_parser('FORMSOF( INFLECTIONAL, run)', 1033, 0, 0)
```
Source_Term— This is the source term as it appears in your query.

When troubleshooting indexing problems, you should consult the full-text error log, which can be found in C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\LOG and starts with the prefix SQLFT followed by the database ID (padded with leading zeros), the catalog ID (query sys.fulltext_catalogs for this value), and then the extension .log. You may find many versions of the log each with a numerical extension, such as SQLFT0001800005.LOG.4; this is the fourth version of this log. These full-text indexing logs can be read by any text editor.

You might find entries in this log that indicate documents were retried or documents failed indexing in addition to error messages returned from the iFilters.